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Abstract:  We  develop  a  higher-order  method  for  non-paraxial  beam 

propagation  based  on  the  wide-angle  split-step  spectral  (WASSS)  method 
previously  reported  [Clark  and  Thomas,  Opt.  Quantum.  Electron.,  4T,  849 
(2010)].  The  higher-order  WASSS  (HOWASSS)  method  approximates  the 
Helmholtz  equation  by  keeping  terms  up  to  third-order  in  the  propagation 
step  size,  in  the  Magnus  expansion.  A  symmetric  exponential  operator  split¬ 
ting  technique  is  used  to  simplify  the  resulting  exponential  operators.  The 
HOWASSS  method  is  applied  to  the  problem  of  waveguide  propagation, 
where  an  analytical  solution  is  known,  to  demonstrate  the  performance 
and  accuracy  of  the  method.  The  performance  enhancement  gained  by 
implementing  the  HOWASSS  method  on  a  graphics  processing  unit  (GPU) 
is  demonstrated.  When  highly  accurate  results  are  required  the  HOWASSS 
method  is  shown  to  be  substantially  faster  than  the  WASSS  method. 
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1.  Introduction 

Beam  propagation  methods  (BPM)  are  a  large  class  of  numerical  methods  for  solving  the  scalar 
Helmoltz  equation,  and  are  popular  for  simulating  guided  waves  and  laser  beams,  as  they  are 
typically  both  fast  and  efficient.  Early  methods  made  use  of  the  paraxial  approximation,  which 
greatly  simplifies  the  problem  by  reducing  the  propagation  equation  to  first  order.  However, 
these  methods  were  severely  limited  in  their  application.  Any  beam  profile  containing  spatial 
frequencies  with  angles  greater  than  a  few  degrees,  with  respect  to  the  propagation  axis,  incur 
significant  phase  errors.  Many  methods  have  been  developed  to  drop  the  paraxial  approximation 
and  include  wide-angle  waves.  In  several  of  these,  the  Helmholtz  equation  is  formally  rewritten 
as  a  first-order  differential  equation  which  includes  the  square  root  of  an  operator.  The  square 
root  operator  is  either  approximated  using  real  or  complex  Pade  approximants  and  a  finite- 
difference  or  iterative  method  is  used  to  solve  the  equation  [1,  2],  or  the  analytical  solution  is 
found  which  results  in  an  exponential  of  the  square  root  of  an  operator  that  is  approximated 
with  a  Pade  approximant  [3].  Recently,  Sharma  extended  an  operator-splitting  technique  used 
on  the  paraxial  wave  equation  to  the  non-paraxial  wave  equation  (Helmholtz  equation)  [4].  The 
splitting  allows  diffraction  and  the  refractive  index  variations  to  be  handled  separately.  Various 
numerical  methods  can  be  used  once  the  operator  has  been  split,  such  as  collocation  or  finite- 
difference  [4,  5]. 

In  a  recent  publication,  we  described  a  numerical  beam  propagation  method  that  represents 
the  beam  profile  in  the  basis  of  the  eigenvectors  of  the  Laplacian  operator  and  uses  a  symmetric 
operator  splitting  technique  to  account  for  the  refractive  index  variations,  known  as  the  wide- 
angle  split-step  spectroscopic  (WASSS)  method  [6].  In  general,  the  method  provided  a  two-fold 
speedup  to  the  finite-difference  method  reported  by  Sharma  [4]  This  improvement  could  be 
increased  by  use  of  a  fast  Fourier  transform  algorithm.  Here  we  develop  a  higher-order  WASSS 
(HOWASSS)  method  that  extends  the  approximation  to  higher-order,  providing  a  more  efficient 
method  when  high  accuracies  are  required.  We  apply  the  HOWASSS  method  to  the  problem 
of  waveguide  propagation  to  demonstrate  the  performance  and  accuracy  of  the  method.  An 
additional  performance  enhancement  is  obtained  by  implementing  the  method  on  a  graphics 
processing  unit  (GPU)  using  compute  unified  device  architecture  (CUDA)  technology  from 
NVIDIA™. 

2.  Formulation 

Beam  propagation  in  a  medium  with  a  non-uniform  refractive  index  is  described  by  the  scalar 
Helmholtz  equation, 

d2 

-^\l/(z,r)  +  Vl\i/(z,r)  +  kln2\i/{z,r)  +  kl(n2(z,r)-n2)\i/(z,r)=0,  ( 1 ) 

where  yr  (z.  r)  is  the  complex  scalar  electric  field,  z  is  a  Cartesian  coordinate  in  the  direction 
of  propagation,  r  are  the  transverse  coordinates,  and  is  the  transverse  Laplace  operator. 
As  usual,  ko  is  the  free  space  wavenumber,  n  (z.  r)  is  the  refractive  index  distribution,  and 
n2  =  min  [n2  (z,v  j] .  For  simplicity  we  will  only  consider  the  two-dimensional  (2D)  case,  how¬ 
ever  generalization  to  three  dimensions  (3D)  follows  quite  trivially.  In  two  dimensions,  the 
electric  field  is  denoted  y/  (z,x).  Note  that  x  is  not  necessarily  a  Cartesian  coordinate  but  could, 
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for  example,  be  a  radial  coordinate.  Expanding  y/(z,x)  in  terms  of  the  eigenfunctions  of  the 
transverse  Laplace  operator  we  write 


V  (z,x)  =  X  fli  (z)  <t>i  (x) ,  (2) 

i 

where 

V2<fc  (x)  = -Aftfc  (x)  (3) 

j  <$>*(x)<$>j(x)Ax  =  Si,j.  (4) 

Multiplying  the  2D  form  of  Eq.  (1)  by  < pj  (x),  integrating  over  the  transverse  coordinate,  and 
applying  Eqs.  (3)  and  (4)  we  are  left  with 

d2  r 

-^2aj  (z)  =  -  {kln2  -  tf)aj  (z)  -  Xa<  (z)  J  k0  (”2  (z>x)  -  ”2)  Qj  (x)  &  (x)  dv-  (5) 


To  discretize  the  problem  space,  let  Nx  and  Nz  be  the  number  of  discrete  grid  points  in  x  and 
z  respectively,  and  let  i,j  €  [l,Vv] ,  and  /  £  [  1 ,  /Vj .  We  let  X;  =  tAx+Xo  and  zi  =  /Az  +  zq,  where 
Ax  and  Az  are  the  corresponding  step  sizes.  The  matrix  N  (z)  is  defined  by 

Nij  (z)  =  kl  (n2  (z,Xi)  -  n2)  8jj.  (6) 

The  discrete  eigentransform  matrix,  S  [6],  transforms  a  vector  into  the  spectral  basis,  while  S  1 
transforms  vectors  back  into  the  spatial  basis.  In  the  case  of  Cartesian  coordinates  with  hard 
boundary  conditions  this  matrix  becomes  the  discrete  Fourier  transform  matrix,  and  in  cylin¬ 
drical  coordinates  with  azimuthal  symmetry  and  hard  boundary  conditions,  S  takes  the  form  of 
a  discrete  Hankel  transform  [7].  We  also  must  define  the  constant  matrix  M  with  elements 


Mij  = 


V/ k2n2-X28ij . 


(7) 


Notice  that  if  A?  >  kori  then  Mjj  becomes  imaginary.  If  we  allow  eigenvalues  such  that  A;  >  kon, 
then  the  trigonometric  functions  appearing  in  P  will  become  hyperbolic  (see  Eq.  (25)),  making 
the  method  numerically  unstable.  For  this  reason,  we  will  limit  Nx  to  satisfy  A,  <  kon.  However, 
we  must  include  at  least  enough  eigenfrequencies  to  sufficiently  represent  the  initial  conditions, 
otherwise  large  errors  will  be  incurred.  Using  these  definitions  Eq.  (5)  becomes 


-M2a(z)-SN(z)S“1a(z). 


Now  if  we  define 


A(z) 

H(z) 


a(z) 

M-^afz). 

0  M 

_-M-M-1SN(z)S-1  oj  ’ 


(8) 


(9) 

(10) 


we  can  write  the  Helmholtz  equation  as  a  first  order  vector  differential  equation 

y-A(z)  =H(z)A(z). 


(11) 
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The  exact  solution  to  Eq.  (1 1)  is  given  by  a  Magnus  expansion  [8], 

A  (z)  =  exp  (Qi  (z,z0)+^2  (z,zo)  +  •  •  -)A(zo)  ■ 
The  first  two  terms  of  the  expansion  are 

Oi  (z,zo)  =  /  H(r)  dr 

£h(z,Zo)  =  \  [  /  [H(f!),H(f2)]  df2df!. 


(12) 

(13) 

(14) 


Here  [H  (fj)  ,H(f2)]  is  the  usual  commutator.  Note  that  Eq.  (11)  is  essentially  the  time- 
dependent  Schrodinger  equation,  in  which  context  the  Magnus  expansion  is  more  well-known 
as  the  time-ordered  exponential  operator  [9].  However,  we  will  refer  to  Eq.  (12)  as  a  Magnus 
expansion  to  be  consistent  with  discussions  concerning  initial  value  problems  of  the  form  of 

Eq.  (11). 

H  ( z )  can  be  written  as  the  sum  of  two  matrices,  one  that  is  constant  and  one  that  depends  on 

z, 

'  0  Ml  r  0  0" 

-M  oj  +  [-M-'SNfcJS-1  0 

Using  this  and  the  trapezoid  rule  to  approximate  the  integral  in  Eq.  (13)  we  obtain, 

,  A z 


H(z)=H1+H2(z)  = 


(15) 


(zt+i)Z/)  =HtAz+  (H2  (z/+i)  +  H2  (z/))y  +  ^((Az)3)  . 


(16) 


To  approximate  Q2  (z/+i  ,Z/)  we  first  need  to  compute  the  commutator.  After  making  use  of  Eq. 
(15)  and  simplifying  we  find 


[H(zi),H(z2)]  = 


S(N  (zi)  —  N(z2))S-1 


0 


0 


M-1S(N(z2)-N(z1))S-1M_  ' 


(17) 


Note  that  [H2  (zi ) ,  H2  (z2)]  =  0.  We  apply  the  trapezoid  method  twice  to  approximate  the  double 
integral  in  Eq.  (14)  and  obtain 


^2  (z/+i,z/)  =  [H(z/+1),H(z/)] 


(Az)2 

8 


((Az)3). 


(18) 


The  next  term  in  the  Magnus  series  will  contribute  a  factor  of  (Az)3  as  it  involves  a  triple 


integral.  Keeping  terms  up  to  (Az) 
( 


A  (z/+i)  ~  exp  ^HiAz+  (H2  (z/+i)  +H2  (z/)  )  y  +  [H(zz+1)  ,H  (z/)]  yy  j  A  (z;) .  (19) 


Because  this  contains  the  exponential  of  a  dense  matrix,  which  is  difficult  to  handle  numerically, 
we  will  split  this  exponential  into  a  form  that  will  be  easier  to  work  with  via  the  symmetric 
exponential  splitting  technique 


Az\  (  A  z 

exp  ((A  +  B)  Az)  =  exp  (  A^  1  exp  (BAz)exp  (  A  — 


'  ((Az)3)  •  (20) 

First,  we  split  the  Az  terms  in  the  exponential  from  the  (Az)“  terms.  Then  we  split  the  terms 

(21) 


involving  Hi  from  those  involving  H2,  and  we  are  left  with 

A  (z/+i)  «  PQ  (z/+i)  Q  (z/)  PC  (zz+i ,zz) PQ  (z/+i)Q  (z,) PA  (z,) , 
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where 


P  =  exp(Hl^ 

Q  (z)  =  exp  ^H2  (z)^j 

¥)■ 


C(zi,z2)  =  exp  (  [H(zi),H(z2)] 


(22) 

(23) 

(24) 


Each  operator  P,  Q(z),  and  C(zi,z2)  can  be  expressed  by  a  2  x  2  block  matrix  where  the 
matrices  within  each  operator  are  diagonal.  This  allows  the  exponential  operators  to  be  simply 
evaluated  by  expanding  the  exponential  into  its  power  series 


'cos(Mf)  sin(Mf) 
—  sin(Mf)  cos(M^) 

I  0 

— M-1SN  (z)  S_1  ^  I 


C(zi,z2)  = 

Sexp((N(z1)-N(z2))^)S“1 

0 


(25) 

(26) 


0 

M-1Sexp((N(z2)-N(zi))^)s-1M 


(27) 


Now  all  matrices  inside  of  functions  are  diagonal  making  them  quite  simple  to  numerically 
evaluate.  In  this  form  the  physical  interpretation  of  these  operators  is  most  transparent.  P  is 
the  operator  corresponding  to  a  propagation  of  the  pulse  through  a  homogenous  medium  with 
an  index  of  refraction  of  «.  Q(z)  takes  into  account  the  difference  n(z,x)  —  n,  while  C(zi,z2) 
depends  on  the  change  in  the  refractive  index  over  the  step  taken.  Thus,  in  this  sense  we  would 
like  to  draw  the  analogy  that  Q(z)  acts  like  the  constant  term  of  a  Taylor’s  series  expansion,  and 
C(zi  ,z2)  acts  in  a  manner  similar  to  the  first  derivative  term  in  such  a  series.  We  can  save  some 
additional  matrix- vector  multiplications  by  combining  Q  (zi+i )  Q  (z/) 


Qfe+i)Qfe) 


i  o 

-M-1S(N(z,+i)+N(z;))S-1f  I 


(28) 


3.  Numerical  example 

To  test  our  method  we  will  simulate  beam  propagation  through  a  two-dimensional  (Cartesian) 
symmetric  Epstein-layer  waveguide  tilted  at  an  angle  9  from  the  positive  z  axis  [5] .  Hard  bound¬ 
ary  conditions  where  i/r(z,xo)  =  0  and  i//(z,x/)  =  0  are  assumed.  The  eigenfunctions  of  the 
transverse  Laplace  operator  are  then  given  by 


< pi  (x)  =  sin  | 


m 


(x/-x o)  J 

The  resulting  eigentransform  matrix  is  given  by  the  discrete  Fourier  matrix 


Su  = 


2  -»J7l(l+])U+])  |  =5 


Nr  +  1 


Nx+  1 


1 

i,j  ‘ 


(29) 


(30) 
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Table  1 .  Parameters  used  in  the  numerical  calculations 


h  =  2.1455 

An  =  0.003 

*0  =  OjUm 

xf  = 

300  jUm 

w  =  5  jum 

kg  =  4.88128  jlim'1 

zo  =  0/im 

Zf  = 

100 /im 

This  particular  Fourier  matrix  was  chosen  because  the  hard  boundary  conditions  at  xg  and 
Xf  are  automatically  satisfied.  We  do  not  need  to  include  these  points  in  our  grid,  defined  by 
Ax  =  (xf  —  x  o)  /  (Nx  +  1 )  andAz  =  (z/  —  zo)  /Nz.  The  refractive  index  profile  is 


n  (, z,x ) 


\ 


«2  +  2n(An)sech2 


2(lcos  (0)  —  z  sin  (0) ) 


(31) 


where  x  is  a  shifted  coordinate  to  align  the  waveguide  in  the  center  of  our  computational  re¬ 
gion  to  make  the  hard  boundary  conditions  as  negligible  as  possible.  An  is  the  height  of  the 
refractive  index  shift,  w  is  the  width  of  the  waveguide.  The  shifted  coordinate  is  given  by 
x  =  x  —  (1/2)  (xf—x o)  +  (1/2)  (zf  —  zo)  tan  (0).  The  initial  electric  field  is  given  by  the  ze¬ 
roth  order  mode  of  the  Epstein-layer  waveguide 

\ff(0,x)  =  sechw  ^-^cos  (0)  ^  eXp  (ig0xsin(fl)),  (32) 

Here,  i  is  the  unit  imaginary  number,  not  to  be  confused  with  the  counting  index  used  elsewhere. 
W  and  Kg  are  given  by 


W  =  —  (^J\  +2w2kQiiAn  —  l^j 

The  exact  solution  for  this  tilted  Epstein-layer  waveguide  is  [10] 


(33) 

(34) 


We  {z,x)  =  sech 


w 


2(lcos  (0)  —  zsin  (0) ) 


exp 


^Xo(xsin(0) +zcos(0))^ .  (35) 


To  measure  the  error  we  use  the  correlation  factor, 


Error  (z) 


(f*o  V*  (z,x)yf(z,x)  dr) 
[fxo  We  (z,x)We(z,x)  d\j 


(36) 


as  this  provides  a  measure  for  both  the  profile  shape  and  amplitude  of  the  beam  [5,  6], 

C  and  CUDA  versions  of  both  the  WASSS  and  HOWASSS  methods  were  implemented.  The 
code  was  compiled  using  nvcc  version  4.2  for  CUDA  code  and  gcc  version  4.6.3  for  the  C  code. 
The  code  was  run  on  a  workstation  equipped  with  an  Intel  Xeon  X5960  CPU,  which  is  a  hyper- 
threaded  six-core  CPU  clocked  at  3.47  GHz  with  48  GB  of  RAM,  and  a  NVIDIA  QUADRO 
6000  GPU,  which  has  448  CUDA  cores  at  574  MHz  core  clock  speed  (750  MHz  memory  clock 
speed)  and  6  GB  of  dedicated  GPU  DDR5  RAM. 

The  simulation  was  run  using  the  parameters  listed  in  Table  1 .  The  value  of  kg  was  selected 
to  ensure  that  we  could  include  1000  transverse  modes  while  still  keeping  M  real- valued.  When 
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Az  =  0.50  urn 
(b)  HOWASSS  Az  =  0.25  Jim 
10'1  -  e-0°  Az  =  0.10  |im 

Az  =  0.05  pm 
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Fig.  1 .  Error  as  a  function  of  propagation  distance  for  an  aligned  waveguide  with  Nx  =  1000 
for  the  (a)  WASSS  and  (b)  HOWASSS  methods.  Note  that  we  have  sampled  the  error  every 
0.5  mm,  and  not  at  every  z  step  calculated,  because  the  rapid  oscillations  would  make  the 
graph  difficult  to  read  otherwise. 


the  waveguide  was  aligned  (9  =  0°)  roughly  an  order  of  magnitude  decrease  in  the  error  was 
observed  with  the  HOWASSS  method  over  the  WASSS  method  (see  Fig.  1).  However,  the 
improvement  was  more  pronounced  when  we  tilted  the  waveguide  at  an  angle  of  50  degrees, 
especially  for  large  step  sizes  (see  Fig.  2).  Figure  3  shows  the  calculated  electric  field  next  to 
the  exact  solution  to  further  illustrate  the  accuracy  of  the  HOWASSS  method.  The  simulations 
shown  in  Figs.  1,  2,  and  3  were  run  using  double  precision  accuracy  numbers  on  the  GPU.  Both 
methods  were  capable  of  producing  accurate  results  using  single  precision  arithmetic,  until 
about  Az  =  0.05  jd  m .  At  this  point,  round-off  error  began  to  affect  the  results  of  the  HOWASSS 
method  due  to  the  increased  number  of  matrix-vector  multiplications  required  in  each  time  step. 
The  WASSS  method  does  not  suffer  as  quickly  from  this  problem  because  it  is  computationally 
more  simple.  The  limit  where  round-off  error  begins  to  affect  the  double  precision  code  was 
not  observed  for  either  method. 


10° 

10'1 
O 

w  10'2 
£ 

1  10'3 
-Q 
< 

10'4 
10'5 
10'6 

0  20  40  60  80  100  0  20  40  60  80  100 

z  (nm)  z  (pm) 

Fig.  2.  Error  as  a  function  of  propagation  distance  for  a  waveguide  rotated  50  degrees  with 
Nx  =  1000  for  the  (a)  WASSS  and  (b)  HOWASSS  methods. 


(b) 

8  =  50° 


Az  =  0.50 
Az  =  0.25  |im 
Az  =  0.10  jim 
Az  =  0.05  nm 


r- 


HOWASSS 


#187594  -  $15.00  USD  Received  25  Mar  2013;  revised  24  May  2013;  accepted  14  Jun  2013;  published  25  Jun  2013 
(C)  2013  OSA  1  July  2013  |  Vol.  21,  No.  13  |  DOI:10.1364/OE.21.015815  ]  OPTICS  EXPRESS  15821 


Distribution  A:  Approved  for  public  release,  distribution  unlimited  (approval  given  by  Public  Affairs  Office  TSRL-PA-12-0068 


I¥l(z,x)  (a.u) 


I'PgKz^)  (a.u) 


80 


?  60 
n  40 

20 

0 


1 

0.9 

100 

0.8 

0.7 

80 

0.6 

? 

60 

0.5 

d. 

0.4 

0.3 

N 

40 

0.2 

0.1 

20 

0  0 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


0  50  100  150  200  250  300 


0  50  100  150  200  250  300 


x  (|im)  x  (|im) 

Fig.  3.  Plot  of  \y{z,x)\  with  Nx  =  1000,  Nz  =  2000,  and  d  =  50°  for  the  (a)  HOWASSS 
and  (b)  exact  solutions  for  a  waveguide  tilted  at  50  degrees. 


4.  Discussion 

4.1.  Stability 

By  virtue  of  being  a  higher  order  method,  the  HOWASSS  method  has  improved  stability  over 
the  WASSS  method  (see  Fig.  4).  As  we  increase  the  tilt  of  the  waveguide,  the  WASSS  method 
performs  more  poorly  as  the  angle  increases,  and  at  large  step  sizes  even  shows  signs  of  being  a 
bit  unstable.  However,  the  HOWASSS  method  actually  becomes  more  accurate  at  larger  angles. 
Traditionally,  beam  propagation  methods  struggle  to  obtain  accurate  results  when  there  are 
rapid  changes  in  the  index  of  refraction.  To  simulate  a  rapidly  changing  index  of  refraction 
we  increase  the  change  in  the  index  between  the  waveguide  and  the  surrounding  medium.  An, 
while  keeping  the  width  of  the  waveguide  fixed.  We  see  that  at  a  50°  waveguide  tilt  the  WASSS 
method  actually  becomes  unstable  with  a  large  stepsize,  while  the  HOWASSS  is  able  to  remain 
stable  for  at  least  an  additional  order  of  magnitude  increase  in  An.  However,  both  methods  do 
loose  accuracy  as  the  change  in  the  refractive  index  becomes  steeper,  but  by  decreasing  A z 
accurate  results  can  still  be  obtained  in  a  reasonable  run  time.  This  same  analysis  was  done  for 
a  0°  tilted  waveguide  and  the  results  were  quite  similar  and  so  are  not  shown  here. 
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Fig.  4.  (a)  The  maximum  error  obtained  as  a  function  of  waveguide  tilt  angle  showing  that 
the  HOWASSS  method  actually  gains  a  small  amount  of  accuracy  at  larger  angles,  (b)  The 
maximum  error  obtained  as  a  function  of  waveguide  depth,  An. 
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Nx  Nx 


Fig.  5.  (a)  Run  times  of  HOWASSS  and  WASSS  methods  on  the  GPU  and  the  single-core 
CPU  for  1000  propagation  steps,  (b)  Comparison  of  FIOWASSS  method  run  times  using 
different  number  of  cores  on  the  CPU  and  using  the  GPU  for  1000  propagation  steps. 


4.2.  Speed 

Computationally,  the  HOWASSS  method  requires  the  multiplication  of  matrices  with  vectors. 
If  we  restrict  Nx  so  that  the  matrix  M  is  real  then,  for  the  example  in  Section  3,  all  the  matrices 
will  be  real,  however  the  vector  A  will  be  complex.  In  general,  the  eigenfunctions  and  the 
refractive  index  could  be  complex.  However,  for  simplicity  we  will  assume  these  to  be  real. 

Diagonal  matrix  multiplication  is  equivalent  to  element-by-element  vector  multiplication. 
Hence,  multiplying  an  Nx  x  Nx  diagonal  matrix  by  an  Nx  element  complex  vector  requires  2NX 
operations.  Multiplying  an  Nx  x  Nx  dense  matrix  by  an  Nx  element  complex  vector  is  equivalent 
to  performing  2 Nx  dot  products,  each  requiring  Nx  multiplications  and  Nx  —  1  additions,  for 
a  total  of  2NX  (2 Nx  —  1)  =  4NX  —  2 Nx  operations.  Applying  P  requires  a  total  of  4  diagonal 
matrix-vector  multiplications  for  a  total  of  8 Nx  operations.  Applying  Q  (zix  \ )  Q  (zi)  requires  2 
diagonal  matrix-vector  multiplications,  2  dense  matrix-vector  multiplications,  2  vector-vector 
additions,  and  1  scalar-vector  multiplication  for  a  total  of  8NX+6NX  =  2NX(4NX  —  3)  operations. 
Applying  C(zi+i,zi)  requires  4  dense  matrix-vector  multiplications,  1  vector-vector  addition, 
and  2  element-by-element  exponential  operations  (assumed  to  only  be  1  operation  per  element) 
for  a  total  of  16NX  —  2 Nx  =  2NX(SNX  —  1)  operations.  P  is  applied  four  times,  Q  (z/+i)  Q  (zi) 
is  applied  twice,  and  C(z/+i,z/)  is  applied  once  each  step,  giving  a  total  of  32NX  +42 Nx  = 
2NX ( 1 6NX  +  21)  operations.  Following  the  same  logic  for  the  WASSS  method,  we  find  that 
it  requires  8 Nx  +  6 Nx  =  2NX (4NX  +  3)  operations.  Note  that  this  operation  count  differs  from 
the  one  reported  by  Clark  and  Thomas  [6]  because  we  are  counting  both  multiplications  and 
additions  in  this  calculation. 

Furthermore,  to  leading  order  in  Nx,  the  HOWASSS  method  is  approximately  a  factor  of 
4  slower  for  a  given  A z  (see  Fig.  5).  However,  the  initial  hypothesis  was  that  a  significant 
improvement  in  accuracy  may  lead  to  a  more  efficient  algorithm.  To  test  this  hypothesis,  we 
executed  the  HOWASSS  and  WASSS  methods  using  the  GPU  implementation  with  various 
step  sizes,  holding  all  other  aspects  of  the  problem  constant.  Figure  6  summarizes  the  result. 
For  a  propagation  length  of  100  micrometers,  and  a  waveguide  tilt  angle  of  50  degrees,  we  show 
the  compute  time  per  micron  of  propagation  as  a  function  of  the  maximum  absolute  error.  For 
two  different  values  of  Nx,  the  efficiency  at  constant  error  is  better  for  the  HOWASSS  method, 
as  indicated  by  shorter  compute  times.  In  fact,  for  error  values  smaller  than  about  10  4,  the 
HOWASSS  method  is  much  better,  with  efficiency  rapidly  exceeding  an  order  of  magnitude  for 
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Fig.  6.  Comparison  of  the  FIOWASSS  method  and  WASSS  method  compute  time  per  prop¬ 
agation  distance  with  Nx  =  1000  and  Nx  =  2000.  Both  methods  were  run  on  the  GPU. 


absolute  error  values  of  less  than  10  6. 

The  data  presented  in  this  paper  was  generated  using  double  precision  arithmetic.  This  does 
not  result  in  a  significant  difference  in  the  run  time  when  the  code  is  run  on  a  CPU.  How¬ 
ever,  GPUs  are  intrinsically  designed  to  deal  with  single-precision  arithmetic.  In  order  to  com¬ 
pute  one  double -precision  number  the  GPU  must  perform  two  single-precision  calculations 
and  some  additional  overhead  to  combine  the  result,  leading  to,  at  a  minimum,  a  factor  of  two 
speed-up  when  single -precision  numbers  are  used.  This  speed-up  also  depends  on  the  specific 
graphics  card  used,  as  some  have  less  capability  than  others  when  it  comes  to  double  precision 
arithmetic.  Rounding  errors  can  affect  the  HOWASSS  method  if  Nz  is  large,  around  2000  for  the 
specific  example  considered  in  this  paper.  If  the  application  does  not  require  extreme  accuracy, 
then  a  substantial  speed-up  can  be  achieved  by  using  single-precision  arithmetic  on  the  GPU. 

4.3.  Parallelization 

Both  the  WASSS  and  HOWASSS  methods  readily  lend  themselves  to  parallelization.  For  our 
implementation  we  chose  to  use  both  OpenMP,  to  make  use  of  multi  core  CPUs,  and  NVIDIA 
compute  unified  device  architecture  (CUDA)  to  make  use  of  the  processing  power  of  the  graph¬ 
ics  processing  unit  (GPU).  Modern  GPUs  possess  orders  of  magnitude  more  computational 
power  than  the  typical  CPU.  However,  to  completely  utilize  this  power  the  algorithm  must  pos¬ 
sess  a  very  high  level  of  parallelism  to  completely  saturate  the  many  processing  units  of  the 
GPU. 

Unfortunately,  neither  the  WASSS  nor  the  HOWASSS  method  are  ideal  for  utilizing  the 
full  power  of  the  GPU  because  many  of  the  matrix-vector  multiplications  involve  diagonal 
matrices.  Faster  than  their  dense  counterparts,  these  computationally  reduce  to  simple  element- 
by-element  vector  multiplication.  While  this  is  a  perfectly  parallel  operation,  it  does  not  offer 
the  large  number  of  independent  calculations  needed  to  saturate  the  GPU.  These  element-by- 
element  vector  multiplications  suffer  additionally  from  the  fact  that  they  require  two  memory 
reads  and  one  memory  write  to  compute  only  one  multiplication.  On  the  GPU  reading  and 
writing  to  memory  is  much  slower  than  math  operations.  Given  this,  there  is  still  a  significant 
speed-up  when  the  code  is  run  on  the  GPU  versus  on  a  single  core  of  the  CPU  (see  Fig.  5).  This 
speed-up  was  accomplished  with  little  attention  payed  to  optimization  of  the  code.  With  further 
optimization  an  additional  factor  of  two  or  more  would  be  likely.  Additionally,  extending  this 
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technique  to  even  higher-orders  might  yield  additional  improvements. 

4.4.  Generalizations  to  other  coordinate  systems,  boundary  conditions,  and  higher  dimen¬ 
sions 

Both  the  WASSS  and  HOWASSS  methods  can  be  generalized  to  different  coordinate  systems 
or  different  boundary  conditions.  In  fact  the  only  complication  is  that  the  eigenfunctions  and 
eigenvalues  must  be  known.  For  a  more  detailed  discussion  on  this  topic  see  Clark  and  Thomas 
[6]. 

5.  Conclusion 

We  have  presented  a  method  that  is  more  efficient  than  previous  non-paraxial  beam  propaga¬ 
tion  methods.  The  method  casts  the  analytic  solution  of  the  Helmholtz  equation  as  a  Magnus 
expansion.  Keeping  terms  up  to  (Az)3  in  the  Magnus  expansion  we  use  a  symmetric  operator 
splitting  technique  in  order  to  analytically  reduce  the  exponential  matrices  into  a  more  simple 
form.  The  solution  to  the  Helmholtz  equation  is  then  approximated  via  straightforward  matrix 
multiplication.  We  have  demonstrated  the  method  in  a  simple  geometry  with  2D  Cartesian  co¬ 
ordinates  with  hard  boundary  conditions.  The  results  obtained  show  our  higher-order  approach 
significantly  improves  the  overall  efficiency,  when  measured  as  compute  time  per  distance  of 
propagation,  in  cases  where  high  accuracy  results  are  required.  The  method  can  be  easily  ex¬ 
tended  to  more  generalized  coordinate  systems,  higher  dimensions,  and  various  boundary  con¬ 
ditions,  provided  that  the  eigenfunctions  and  eigenvalues  of  the  transverse  Laplace  operator  can 
be  found  for  that  geometry. 
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